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Abstract — We present a two-stage approach for learning dic- 
tionaries for object classification tasks based on the principle 
of information maximization. The proposed method seeks a 
dictionary that is compact, discriminative, and generative. In the 
first stage, dictionary atoms are selected from an initial dictionary 
by maximizing the mutual information measure on dictionary 
compactness, discrimination and reconstruction. In the second 
stage, the selected dictionary atoms are updated for improved 
reconstructive and discriminative power using a simple gradient 
ascent algorithm on mutual information. Experiments using real 
datasets demonstrate the effectiveness of our approach for image 
classification tasks. 

Index Terms — Dictionary learning, information theory, mutual 
information, entropy, image classification. 

I. Introduction 

Sparse signal representations have recently drawn much 
traction in vision, signal and image processing Q, (2), (3), 
l4l . This is mainly due to the fact that signals and images of 
interest can be sparse in some dictionary. Given a redundant 
dictionary D and a signal y, finding a sparse representation 
of y in D entails solving the following optimization problem 

x = argmin ||x||o subject to y = Dx, (1) 

X 

where the £q sparsity measure ||x||o counts the number of 
nonzero elements in the vector x. Problem ([]} is NP-hard and 
cannot be solved in a polynomial time. Hence, approximate 
solutions are usually sought 0, 0, (6), Q. 

The dictionary D can be either based on a mathematical 
model of the data | 3 ] or it can be trained directly from the 
data 0. It has been observed that learning a dictionary directly 
from training rather than using a predetermined dictionary 
(such as wavelet or Gabor) usually leads to better representa- 
tion and hence can provide improved results in many practical 
applications such as restoration and classification 0, 0, 0, 

® 

Various algorithms have been developed for the task of 
training a dictionary from examples. One of the most com- 
monly used algorithms is the K-SVD algorithm ifTOl . Given 
a set of examples {yi}£=i, K-SVD finds a dictionary D that 
provides the best representation for each example in this set 
by solving the following optimization problem 

(D,X) = axgmin||Y-DX||| subject to ||xi|| < T , 

(2) 

Q. Qiu, V. M. Patel, and R. Chellappa are with the Center for Automation 
Research, UMIACS, University of Maryland, College Park, MD 20742 USA 
(e-mail: qiu@cs.umd.edu, {pvishalm, rama}@umiacs.umd.edu) 



where x^ represents the i column of X, Y is the matrix 
whose columns are y^ and To is the sparsity pa rameter. Here, 
the Frobenius norm is defined as ||A||f = ^J^Zij A?j . The 
K-SVD algorithm alternates between sparse-coding and dic- 
tionary update steps. In the sparse-coding step, D is fixed and 
the representation vectors x^s are found for each example y^. 
Then, the dictionary is updated atom-by-atom in an efficient 
way. 

Dictionaries can be trained for both reconstruction and 
discrimination applications. In the late nineties, Etemand and 
Chellappa proposed a linear discriminant analysis (LDA) 
based basis selection and feature extraction algorithm for 
classification using wavelet packets fTR . Recently, similar 
algorithms for simultaneous sparse signal representation and 
discrimination have also been proposed in fT2l . fT3lL fT4ll . 
fT5l . Some of the other methods for learning discriminative 
dictionaries include QU, fi3, EE E2i EQ|, ED, G2- 
Additional techniques may be found within these references. 

In this paper, we propose a general method for learning 
dictionaries for image classification tasks via information 
maximization. Unlike other previously proposed dictionary 
learning methods that only consider learning only recon- 
structive and/or discriminative dictionaries, our algorithm can 
learn reconstructive, compact and discriminative dictionaries 
simultaneously. Sparse representation over a dictionary with 
coherent atoms has the multiple representation problem. A 
compact dictionary consists of incoherent atoms, and encour- 
ages similar signals, which are more likely from the same 
class, to be consistently described by a similar set of atoms 
with similar coefficients l2D . A discriminative dictionary 
encourages signals from different classes to be described by 
either a different set of atoms, or the same set of atoms but with 
different coefficients fT3lL fT5lL ifTTl . Both aspects are critical 
for classification using sparse representation. The additional 
reconstructive requirement to a compact and discriminative 
dictionary enhances the robustness of the discriminant sparse 
representation [13]. All these three criteria are critical for 
classification using sparse representation. 

Our method of training dictionaries consists of two main 
stages involving greedy atom selection and simple gradient 
ascent atom updates, resulting in a highly efficient algorithm. 
In the first stage, dictionary atoms are selected in a greedy 
way such that the common internal structure of signals be- 
longing to a certain class is extracted while at the same time 
ensuring global discrimination among the different classes. 
In the second stage, the dictionary is updated for improved 
discrimination and reconstruction via a simple gradient ascent 
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Our approach (acc: 91 .89%) SOMP (acc: 36.03%) MMI-1 (acc: 32.43%) MMI-2 (acc: 38.73%) 



(a) Four subjects 






(b) Sparse representation for images of four subjects (Sparsity = 3) 
Our approach (acc: 93.19%) SOMP (acc: 73.46%) MMI-1 (acc: 70.06%) MMI-2 (acc: 70.06%) 





(c) Four digits (d) Sparse representation for four handwritten digits (Sparsity = 3) 

Fig. 1: Sparse representation using dictionaries learned by different approaches (SOMP [22], MMI-1 and MMI-2 l2llD . For 
visualization, sparsity 3 is chosen, i.e., no more than three dictionary atoms are allowed in each sparse decomposition. When 
signals are represented at once as a linear combination of a common set of atoms, sparse coefficients of all the samples become 
points in the same coordinate space. Different classes are represented by different colors. The recognition accuracy is obtained 
through linear SVMs on the sparse coefficients. Our approach provides more discriminative sparse representation which leads 
to significantly better classification accuracy. 



method that maximizes the mutual information (MI) between 
the signals and the dictionary, as well as the sparse coefficients 
and the class labels. 

Fig. [T] presents a comparison in terms of the discrimina- 
tive power of the information-theoretic dictionary learning 
approach presented in this paper with three state-of-the-art 
methods. Scatter plots of sparse coefficients obtained using 
the different methods show that our method provides more 
discriminative sparse representation, leading to significantly 
better classification accuracy. 

The organization of the paper is as follows. Section [TT] 
defines and formulates the information theoretic dictionary 
learning problem. In Section [In| the proposed dictionary learn- 
ing algorithm is detailed. Experimental results are presented 
in Section [IV] and Section [V] concludes the paper with a brief 
summary and discussion. 

II. Background and Problem Formulation 

Suppose we are given a set of N signals (images) in an 
n-dim feature space Y = [yi,...,yN]> yi £ Given that 
signals are from p distinct classes and N c signals are from 
the c-th class, c £ {1, • • • we denote Y = {Y c }£ =1 , 
where Y c = [yj,-'' jYjvJ are sl S na l s m me c_m class. 
When the class information is relevant, similarly, we define 
X = {X c }^ =1 , where X c = [xf,--- , x^J is the sparse 
representation of Y c . 

Given a sample y at random, the entropy (uncertainty) of 
the class label in terms of class prior probabilities is defined 



as 



H(C) 



The mutual information which indicates the decrease in un- 
certainty about the pattern y due to the knowledge of the 
underlying class label c is defined as 

I(Y;C) = H(Y)-H(Y\C), 

where H(Y\C) is the conditional entropy defined as 



fT(Y|C) = $>(y,c)log 



1 



p(y\c) 



Given Y and an initial dictionary D° with £2 normal- 
ized columns, we aim to learn a compact, reconstructive 
and discriminative dictionary D* via maximizing the mutual 
information between D* and the unselected atoms D°\D* in 
D°, between the sparse codes Xd* associated with D* and 
the signal class labels C, and finally between the signals Y 
and D*, i.e., 

arg maxAi J(D; D°\D) + A 2 /(X D ; C) + A 3 I(Y; D) (3) 

where {Ai,A2,As} are the parameters to balance the contri- 
butions from compactness, discriminability and reconstruction 
terms, respectively. 

It is widely known that inclusion of additional criteria, such 
as a discriminative term, in a dictionary learning framework 
often involves challenging optimization algorithms fT7lL lfl8lL 
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lfT2l . As discussed above, compactness, discriminability and 
reconstruction terms are all critical for classification using 
sparse representation. Maximizing mutual information enables 
a simple way to unify all three criteria for dictionary learning. 
As suggested in [23 ] and [21 ], maximizing mutual information 
can also lead to a sub-modular objective function, i.e., a greedy 
yet near-optimal approach, for dictionary learning. 

A two-stage approach is adopted to satisfy ([3]). In the first 
stage, each term in ^ is maximized in a unified greedy 
manner and involves a closed-form evaluation, thus atoms can 
be greedily selected from the initial dictionary while satisfying 
([3]). In the second stage, the selected dictionary atoms are 
updated using a simple gradient ascent method to further 
maximize 

A 2 /(X D ;C) + A 3 /(Y;D). 

III. Information-theoretic Dictionary Learning 

In this section, we present the details of our Information- 
theoretic Dictionary Learning (ITDL) approach for classifica- 
tion tasks. The dictionary learning procedure is divided into 
two main steps: Information-theoretic Dictionary Selection 
(ITDS) and Information-theoretic Dictionary Update (ITDU). 
In what follows, we describe these steps in detail. 

A. Dictionary Selection 

Given input signals Y and an initial dictionary D°, we se- 
lect a subset of dictionary atoms D* from D° via information 
maximization, i.e., maximizing ([3]), to encourage the signals 
from the same class to have very similar sparse representation 
yet have the discriminative power. In this section, we illustrate 
why each term in ^ describes the dictionary compactness, 
discrimination and representation, respectively. We also show 
that how each term in ([3| can be maximized in a unified greedy 
manner that involves closed- form computations. Therefore, if 
we start with D* = 0, and greedily select the next best atom 
d* from D°\D* which provides an information increase to 
([3]), we obtain a set of dictionary atoms that is compact, 
reconstructive and discriminative at the same time. To this 
end, we consider in detail each term in ([3} separately. 

I) Dictionary compactness /(D*; D°\D*): The dictionary 
compactness 7(D*;D°\D*) has been studied in our early 
work [21 ]. We summarize 1211 to complete our information- 
driven dictionary selection discussion. [ 21 ] suggests dictionary 
compactness is required to avoid the multiple sparse represen- 
tation problem for better classification performance. In |2T1L 
we first model sparse representation through a Gaussian Pro- 
cess model to define the mutual information J(D*;D°\D*). 
A compact dictionary can be then obtained as follows: we start 
with D* = and iteratively choose the next best dictionary 
item d* from D°\D* which provides a maximum increase in 
mutual information, i.e., 

argmax d , GDo \ D ,/(D*Ud*;D \(D*Ud*))-/(D*;D°\D*). 

(4) 

It has been proved in l23l that the above greedy algorithm 
serves a polynomial-time approximation that is within (1 — 
1/e) of the optimum. 



2) Dictionary Discrimination /(Xd*;C): Using any pur- 
suit algorithm such as OMP 0, we initialize the sparse 
coefficients Xd« for input signals Y and an initial dictionary 
D°. Given Xd* are sparse coefficients associated with the 
desired set of atoms D* and C are the class labels for input 
signals Y, based on l24li . an upper bound on the Bayes error 
over sparse representation £?(Xd*) is obtained as 

^(F(C)-/(X D ,;C)). 

This bound is minimized when J(Xd*;C) is maximized. 
Thus, a discriminative dictionary D* is obtained via 

arg max /(Xd*; C). (5) 

We maximize §5§ using a greedy algorithm initialized by D* = 
and iteratively choosing the next best dictionary atom d* 
from D°\D* which provides a maximum mutual information 
increase, i.e., 

arg max 7(X D * ud * ; C) - J(X D * ; C), (6) 

d* ED°\D* 

where /(Xd* ; C) is evaluated as follows 

/(X D *; C) = #(X D *) - #(X D * \C) (7) 

v 

= ff(X D ,)-5>(c)F(X D ,|c). 

c=l 

Entropy measures in ([7} involve computation of probability 
density functions £>(Xd* ) and £>(Xd* \c). We adopt the kernel 
density estimation method l25ll to non-parametrically estimate 
the probability densities. Using isotropic Gaussian kernels (i.e. 
T, = a 2 1, where I is the identity matrix), the class dependent 
density for the c-th class can be estimated as 

1 Nc 

where Kg is a d-dim Gaussian kernel defined as 

Kg(x, E) = 1 exp (- VirO . (9) 
(2tt) 2 |S| 2 V Z J 

With p(c) = ^, we can estimate p(x) as 

^( x ) = ^p( x I c M c )- 

c 

3) Dictionary Representation 7(Y;D*); A representative 
dictionary D* maximizes the mutual information between 
dictionary atoms and the signals, i.e., 

argmax/(Y;D*). (10) 

We obtain a representative dictionary via a similar greedy 
manner as discussed above. That is, we iteratively choose the 
next best dictionary atom d* from D°\D* which provides the 
maximum increase in mutual information, 

arg max I(Y; D* U d*) - I(Y; D*). (11) 
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By assuming the signals are drawn independently and using 
the chain-rule of entropies, we can evaluate J(Y;D*) as 



I(Y; D*) = H(Y) - H(Y\D*) 

N 

= #(Y)-^#(y;|D*). 



(12) 



H(Y) is independent of dictionary selection and can be 
ignored. To evaluate H(yi\D*) in (12), we define p(y^|D*) 
through the following relation holding for each input signal 

y { D*Xi + r i? 

where is a Gaussian residual vector with variance a 2 . . Such 
a relation can be written in a probabilistic form as, 

p(y,|D*)ocexp(-^||yi-D*x i || 2 ). 

4) Selection of \\, A 2 and A3: The parameters Ai, A2 and 
A3 in ([3]) are data dependent and can be estimated as the ratio 
between the maximal information gained from an atom to 
the respective compactness, discrimination and reconstruction 
measure, i.e., 



Ai 



(13) 



max^ /(Xd. ; C) 



A, = 



max i /(d i ;D \di)' 
max^ /(Y; di) 
" 3 = max i /(d i ;D \d i )' 

For each term in ([3]), only the first greedily selected atom 
based on ([4]), ^ and ( pT) , respectively are involved in param- 
eter estimation. This leads to an efficient process in finding 
parameters. 

B. Dictionary Update 

A representative and discriminative dictionary D produces 
the maximal MI between the sparse coefficients and the class 
labels, as well as the signals and the dictionary, i.e., 

maxA 2 /(X D ;C) + A 3 /(Y;D). 

In the dictionary update stage, we update the set of selected 
dictionary atoms D to further enhance the discriminability and 
representation. 

To achieve sparsity, we assume the cardinality of the set 
of selected atoms D is much smaller than the dimension of 
the signal feature space. Under such an assumption, the sparse 
representation of signals Y can be obtained as Xd = D^Y 
which minimizes the representation error ||Y — DXd|||^, 
where 

Dt = (D T D) 1 D T . 

Thus, updating dictionary atoms for improving discriminabil- 
ity while maintaining representation is transformed into find- 
ing that maximizes 

J(DtY;C). 



1 ) A Differentiable Objective Function: To enable a simple 
gradient ascent method for dictionary update, we first ap- 
proximate /(Dt Y; C) using a differentiable objective func- 
tion. /(X; C) can be viewed as the Kullback-Leibler (KL) 
divergence D(p\\q) between p(X, C) and p(X.)p(C), where 
X = D^Y. Motivated by (26), we approximate the KL di- 
vergence D(p\\q) with the quadratic divergence (QD), defined 
as 

Q(p\\q) = J(p(t) - q(t)) 2 dt, 

making /(X; C) differentiable. Due to the property that 

D(p\\q) > l -Q{p\\q), 

by maximizing the QD, one can also maximize a lower bound 
to the KL divergence. With QD, /(X; C) can now be evaluated 

as, 



(14) 



c ' /x 

~ 2 E / ^( x ' C M X M C ) ^ x 

c ' /x 

+ W p(x) 2 p(c) 2 dx. 

„ J x 



In order to evaluate the individual terms in ( fT4] ), we need to 
derive expressions for the kernel density estimates of various 
density terms appearing in fl4| ). Observe that for the two 
Gaussian kernels in ([9]), the following holds 

/ K G (x-s i ,5] 1 )K G (x-s j ,5] 2 ) dx = K G (s i -s j ,5] 1 +5] 2 ). 

J X 

(15) 

N. 



Using ([8|, p(c) = and p(x, c) = p(x|c)p(c), we have 
1 Nc 

p(x,c) = -^K G (x- Xj c ,a 2 I). 

Similarly, since p(x) = ^c-P( x ' c )> we ^ ave 
1 N 

p(x) = ^5> G (x-x,,a 2 I). 



Inserting expressions for p(x, c) and p(x) into ( 14) and using 



(15), we get the following closed form 

p N c N c 

/Q(X;C) = ^^^^K G (x c fe -xf,2 ( x 2 I) (16) 

C=l k = l 1 = 1 

2 p N Nc N 

E§EE^H- x ^ 21 ) 



N 2 ^ N 

c=l j = l k=l 



\c=l 



2\ N N 



fc=i 1=1 



(17) 
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2) Gradient Ascent Update: For simplicity, we define a new 
matrix <l> as 

$4( D t) T 

Once we have estimated Ig(X; C) as a function of the data set 
in a differential form, where X = <£ T Y, we can use gradient 
ascent on /q(X; C) to search for the optimal $ maximizing 
the quadratic mutual information with 

dl 

$fc+l =§ k +V-^\$ = 3, k 

where v > defining the step size, and 



c=l i=l 



<9xJ 9$ 



Since = $ T yf , we get 
<9x, c 



<9$ 



= (y?r- 



Note that 

d 



9x K G ( Xi - Xj) 2^1) = K G ( Xi - Xj) 2a 2 I)- ^ 



r 2j\ ( x » x i) 



We have 



9 



fe=i 



iV 2 (T 2 



'£(#) 2 )x><> 

Vc=l v / / fc=l 



X fc -X 2 C ,2cr^I)(x fc -X, C 



/c=l 



(18) 

Once <£> is updated, the dictionary D can be updated using 
the relation = (D^) T . Such dictionary updates guarantee 
convergence to a local maximum due to the fact that the 
quadratic divergence is bounded l27l . 

C. Dictionary Learning Framework 

Given a dictionary D°, a set of signals Y, the class labels 
C and a sparsity level T, the supervised sparse coding method 
given in Algorithm [T] represents these signals at once as 
a linear combination of a common subset of T atoms in 

D, where T is much smaller than the dimension of the 
signal feature space to achieve sparsity. We obtain a sparse 
representation as each signal has no more than T coefficients 
in its decomposition. The advantage of simultaneous sparse 
decomposition for classification has been discussed in |[T3l . 
Such simultaneous decompositions extract the internal struc- 
ture of given signals and neglects minor intra-class variations. 
The ITDS stage in Algorithm [T] ensures such common set of 
atoms are compact, discriminative and reconstructive. 

When the internal structures of signals from different classes 
can not be well represented in a common linear subspace, Al- 
gorithmic illustrates supervised sparse coding with a dedicated 
set of atoms per class. It is noted in Algorithm [2] that both the 
discriminative and reconstructive terms in ITDS are handled 
on a class by class basis. 



Input: Dictionary D°, signals Y, class labels C, sparsity level 
T 

Output: sparse coefficients X, reconstruction Y 
begin 

Initialization stage: 

1. Initialize X with any pursuit algorithm, 
i = I,-- - ,N min Xi ||y, -D°Xi||l s.t. ||x,|| < T. 

ITDS stage (shared atoms): 

2. Estimate Ai, A2 and A3 from Y, X and C\ 

3. Find T most compact, discriminative and reconstructive 
atoms: 

D* <- 0; T <- ; 
for t-l to T do 

d* <- arg max # Ai [J(D* U d; D°\(D* U d)) - 

J(D*; D°\D*)] + A 2 [/(X D *ud; C) - J(X D . ; C)] + 
A 3 [/(Y;D*Ud)-/(Y;D*)]; 

D* ^D*|jd*; 

r <- T |J 7*, 7* is the index of d* in D° ; 

end 

4. Compute sparse codes and reconstructions: 
X <(— pinv(D*)Y; 
Y <- D*X; 

5. return X, Y, D*, F ; 
end 



Algorithm 1: Sparse coding with global atoms. 



Input: Dictionary D°, signals Y = {Y c }£ =1 , sparsity level T 
Output: sparse coefficients {X c }^ =1 , reconstruction {Y c }^ =1 
begin 

Initialization stage: 

1. Initialize X with any pursuit algorithm, 
i = I,-- - ,N min Xi || yi -D°Xi||i s.t. ||x,|| < T. 

ITDS stage (dedicated atoms): 
for c-1 to p do 

2. C c ^— {ci\ci = 1 if yi G Y c , otherwise } ; 

3. Estimate Ai, A2 and A3 from Y c , X and C c ; 

4. Find T most compact, discriminative and 
reconstructive atoms for class c: 

D* <- 0; T <- ; 

for t-l to T do 

d* <- arg max Ai [I CD* U d; D°\(D* U 
deD°\D* 

d)) -7(D*;D \D*)] + A 2 [/(XD*ud;a) - 
/(X D * ; C c )] + A 3 [/(Y C ; D* U d) - /(Y c ; D*)]; 

D* ^D*|Jd*; 

r <- r |J 7*, 7* is the index of d* in D° ; 

end 

<- D*; T c ^ T; 

5. Compute sparse codes and reconstructions: 
X c <-pira;(D£)Y c ; 
Y c <- D*X C ; 

end 

6. return {X c }* =1 , {Y c }* =1 , {D^ =1 , {r c }Li ; 
end 



Algorithm 2: Sparse coding with atoms per class. 



A sparse dictionary learning framework, such as K-SVD 
ifTOl which learns a dictionary that minimizes the reconstruc- 
tion error, usually consists of sparse coding and update stages. 
In K-SVD, at the coding stage, a pursuit algorithm is employed 
to select a set of atoms for each signal; and at the update stage, 
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Input: Dictionary D°, signals Y = {Y c } p c=1 , class labels C, 

sparsity level T, update step v 
Output: Learned dictionary D, sparse coefficients X, 
reconstruction Y 

begin 

Sparse coding stage: 

Use supervised sparse coding to obtain {D*}£ =1 . 

ITDU stage: 
foreach class c do 

[In the shared atom case, use the global label C 
instead of C c , and one iteration is required as the same 
D* is used for all classes.] 
C c ^— {ci\ci = 1 if yi G Y c , otherwise } ; 
<3>i <- pinv(D*) T ; 
X <- pinv(D*)Y\ 
repeat 

$fc+i = 
D* 

X <- pinv(D*)Y; 
until convergence; 
D: ^ D* ; 

end 

foreach class c do 

X c <- pinv(D*)Y c ; 
Y c <- D^Xc; 

end 

return {X c }* =1 , {Y c }* =1 , {D^ =1 ; 

end 



Algorithm 3: Sparse coding with atom updates. 



the selected atoms are updated through SVD for improved 
reconstruction. Similarly, in Algorithm [3] at the coding stage, 
ITDS is employed to select a set of atoms for each class of 
signals; and at the update stage, the selected atoms are updated 
through ITDU for improved reconstruction and discrimination. 
Algorithm [3] is also applicable to the case when sparse coding 
is achieved using global atoms. 

IV. Experimental Evaluation 

This section presents an experimental evaluation on three 
public datasets: the Extended YaleB face dataset (28) . the 
USPS handwritten digits dataset (29), and the 15-Scenes 
dataset (30). The Extended YaleB dataset contains 2414 frontal 
face images for 38 individuals. This dataset is challenging 
due to varying illumination conditions and expressions. The 
USPS dataset consists of 8-bit 16x16 images of "0" through 
"9" and 1100 examples for each class. The 15-Scenes dataset 
contains 4485 images falling into 15 scene categories. The 15 
categories include images of living rooms, kitchens, streets, 
industrials, etc.. In all of our experiments, linear SVMs on 
the sparse coefficients are used for classifiers. First, we thor- 
oughly evaluate the basic behaviors of the proposed dictionary 
learning method. Then we evaluate the discriminative power 
of the ITDL dictionary over the full Extended YaleB dataset, 
the full USPS dataset, and the 15-Scenes dataset. 

A. Evaluation with Illustrative Examples 

To enable visualized illustrations, we conduct the first set of 
experiments on the first four subjects in the Extended YaleB 



face dataset and the first four digits in the USPS digit dataset. 
Half of the data are used for training and the rest is used for 
testing. 

1 ) Comparing Atom Selection Methods: We initialize a 
128 sized dictionary using the K-SVD algorithm (TO) on the 
training face images of the first four subjects in the Extended 
YaleB dataset. A K-SVD dictionary only minimizes the recon- 
struction error and is not yet optimal for classification tasks. 
Though one can also initialize the dictionary directly with 
training samples or even with random noise, a better initial 
dictionary generally helps ITDL in terms of classification 
performance, due to the fact that an ITDL dictionary converges 
to a local maximum. 

In Fig. [2] we present the recognition accuracy and the 
reconstruction error with different sparsity on the first four 
subjects in the Extended YaleB dataset. The Root Mean Square 
Error (RMSE) is employed to measure the reconstruction error. 
To illustrate the impact of the compactness, discrimination 
and reconstruction terms in ([3]), we keep one term at a 
time for the three selection approaches, i.e., the compact, the 
discriminative and the reconstructive method. The compact 
method is equivalent to MMI-1 (2D . 

Parameters Ai, A2 and A3 in ([5} are estimated as discussed 
in Section |III-A4| As the dictionary learning criteria becomes 
less critical when sparsity increases, i.e., more energies in 
signals are actually preserved, we focus on curves in Fig. [2] 
when sparsity < 20. Although sparse coding methods generally 
perform well for face recognition, it is still easy to notice 
that the proposed ITDS method using all three terms (red) 
significantly outperforms those which optimize just one of the 
three terms, compactness (black), discrimination (blue), and 
representation (green), in terms of recognition accuracy. For 
example, the discrimination term alone (blue) leads to a better 
initial but poor overall recognition performance. The proposed 
ITDS method also provides moderate reconstruction error. 

It is noted that IDS exhibits comparable recognition accu- 
racy to MMI-2 (pink) (2D with global atoms, and significantly 
outperforms it with class dedicated atoms. The reason is 
that, instead of explicitly considering the discriminability of 
dictionary atoms, MMI-2 enforces the diversity of classes 
associated with atoms. Such class diversity criteria becomes 
less effective when there are only two classes in the dedicate 
atom case. In Fig. [2| it is interesting to note that the recon- 
structive method delivers nearly identical recognition accuracy 
and RMSE to SOMP (22) with both the shared and dedicated 
atoms, given the different formulations of two methods. The 
proposed dictionary selection using all three terms provides a 
good local optimum to converge at the dictionary update stage. 

2) Enhanced Discriminability with Atom Update: We illus- 
trate how the discriminability of dictionary atoms selected by 
the ITDS method can be further enhanced using the proposed 
ITDU method. We initialize a 128 sized K-SVD dictionary 
for the face images and a 64 sized K-SVD dictionary for the 
the digit images. Sparsity 2 is adopted for visualization, as the 
non-zero sparse coefficients of each image can now be plotted 
as a 2-D point. In Fig. [3] with a common set of atoms shared 
over all classes, sparse coefficients of all samples become 
points in the same 2-D coordinate space. Different classes are 
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Fig. 2: Recognition accuracy and RMSE on the YaleB dataset using different dictionary selection methods. We vary the sparsity 
level, i.e., the maximal number of dictionary atoms that are allowed in each sparse decomposition. In (a) and (b), a global set 
of common atoms are selected for all classes. In (c) and (d), a dedicated set of atoms are selected per class. In both cases, the 
proposed ITDS (red lines) provides the best recognition performance and moderate reconstruction error. 




(a) Before update 



(b) After 100 updates 




(c) Converge after 489 updates 



(d) Before update 



33 



(e) After 50 updates 



(f) Converge after 171 updates 



Fig. 3: Information-theoretic dictionary update with global atoms shared over classes. For a better visual representation, sparsity 
2 is chosen and a randomly selected subset of all samples are shown. The recognition rate associated with (a), (b), and (c) are: 
30.63%, 42.34% and 51.35%. The recognition rate associated with (d), (e), and (f) are: 73.54%, 84.45% and 87.75%. Note 
that the proposed ITDU effectively enhances the discriminability of the set of common atoms. 



represented by different colors. The original images are also 
shown and placed at the coordinates defined by their non-zero 
sparse coefficients. The atoms to be updated in Fig. [3a] and 
[3d] are selected using ITDS. We can see from Fig. [3] that the 
proposed ITDU method makes sparse coefficients of different 
classes more discriminative, leading to significantly improved 
classification accuracy. Fig. [4] shows that the ITDU method 
also enhances the discriminability of atoms dedicated to each 
class. It is noted that, though the dictionary update sometimes 
only converges after a considerable number of iterations, based 



on our experience, the first 50 to 100 iterations in general bring 
significant improvement in classification accuracy. 

3) Enhanced Reconstruction with Atom Update: From 



Fig. 5e we notice obvious errors in the reconstructed digits, 



shown in Fig. [3d] with atoms selected from the initial K-S VD 
dictionary using ITDS. After 30 ITDU iterations, Fig. [5f| shows 
that all digits are reconstructed correctly with a unified intra- 
class structure and limited intra-class variation. This leads to a 
more accurate classification as shown in Fig. [4] It is noted that 
Fig. [5] and Fig. [4] are results from the same set of experiments. 
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(a) Before dictionary update (Acc .= 85.71%) 




(c) Converge after 57 update iterations (Acc.= 90.47%) 



Fig. 4: Information-theoretic dictionary update with dedicated atoms per class. The first four digits in the USPS digit dataset 
are used. Sparsity 2 is chosen for visualization. In each figure, signals are first represented at once as a linear combination of 
the dedicated atoms for the class colored by red, then sparse coefficients of all signals are plotted in the same 2-D coordinate 
space. The proposed ITDU effectively enhances the discriminability of the set of dedicated atoms. 



TABLE I: Classification rate (%) on the USPS dataset. 



Proposed 


SDL-D [ 18 1 


SRSC 1151 


FDDL 11121 


| k-NN 


SVM-Gauss | 


| 98.28 


96.44 


93.95 


96.31 


| 94.80 


95.80 | 



TABLE II: Classification rate (%) on the 15 scenes dataset. 

| Proposed | ScSPM [ 31 1 | KSPM [30| | KC 021 | LSPM 11311^ 
I 81.13 I 80.28 I 76.73 I 76.67 I 65.32 I 



TABLE III: Classification rate (%) on the Extended YaleB face dataset. 



Proposed 


D-KSVD [19| 


LC-KSVD [20| | K-SVD [10| 


SRC [33 | 


LLC [34| | 


| 95.39 


94.10 


95.00 | 93.1 


80.5 


90.7 [ 



As can be seen from Fig. 5g after ITDU converges, all digits 
are reconstructed correctly with the true underlying intra-class 
structures, i.e., the left-slanted and right-slanted styles for both 
digits "1" and "0". Fig. 5h shows the images in Fig. 5d with 
60% missing pixels. The recognition rate for Fig. |5i| Fig. [5j] 
and Fig. Bk] are 76.87%, 85.03% and 85.71%, respectively. 



B. Discriminability of ITDL Dictionaries 

We evaluate the discriminative power of ITDL dictionaries 
over the complete USPS dataset, where we use 7291 images 
for training and 2007 images for testing, and the Extended 
YaleB face dataset, where we randomly select half of the 



images as training and the other half for testing, and finally 
the 15-Scenes dataset, where we randomly use 100 images per 
class for training and used the remaining data for testing. 

For each dataset, we initialize a 512 sized dictionary from 
K-SVD and set the sparsity to be 30. Then we perform 30 it- 
erations of dictionary update and report the peak classification 
performance. Here we adopt a dedicated set of atoms for each 
class and input the concatenated sparse representation into a 
linear SVM classifier. For the Extended YaleB face dataset, 
we adopt the same experimental setup in l20l . As shown in 
Table [l| Table [n] and Table [III] our method is comparable 
to some of the competitive discriminative dictionary learning 



(a) before update (b) 30 iterations 



(c) 57 iterations 
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Fig. 5: Reconstruction using class dedicated atoms with the proposed dictionary update (sparsity 2 is used.), (a), (b) and (c) 
show the updated dictionary atoms, where from the top to the bottom the two atoms in each row are the dedicated atoms for 
class 'r/2','3' and '0'. (e), (f) and (g) show the reconstruction to (d). (i), (j) and (k) show the reconstruction to (h). (h) are 
images in (d) with 60% missing pixels. Note that ITDU extracts the common internal structure of each class and eliminates 
the variation within the class, which leads to more accurate classification. 



algorithms such as SDL-D d, SRSC 021, D-KSVD EH 
and LC-KSVD (20 1. Note that, our method is flexible enough 
that it can be applied over any dictionary learning schemes to 
enhance the discriminability. 

V. Conclusion 

We have presented an information theoretic approach to 
dictionary learning that seeks a dictionary that is compact, 
reconstructive and discriminative for the task of image classi- 
fication. The algorithm consists of dictionary selection and 
update stages. In the selection stage, an objective function 
is maximized using a greedy procedure to select a set of 
compact, reconstructive and discriminative atoms from an 
initial dictionary. In the update stage, a gradient ascent al- 
gorithm based on the quadratic mutual information is adopted 
to enhance the selected dictionary for improved reconstruction 
and discrimination. Both the proposed dictionary selection and 
update methods can be easily applied for other dictionary 
learning schemes. 

Acknowledgment 

The work was partially supported by a MURI from the 
Office of Naval Research under the Grant N00014-10-1-0934. 

References 

[1] R. Rubinstein, A. Bruckstein, and M. Elad, "Dictionaries for sparse 
representation modeling," Proceedings of the IEEE, vol. 98, no. 6, pp. 
1045 -1057, Jun. 2010. 

[2] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan, "Sparse 
representation for computer vision and pattern recognition," Proceedings 
of the IEEE, vol. 98, no. 6, pp. 1031 -1044, June 2010. 

[3] M. Elad, M. Figueiredo, and Y. Ma, "On the role of sparse and redundant 
representations in image processing," Proceedings of the IEEE, vol. 98, 
no. 6, pp. 972 -982, June 2010. 

[4] V. M. Patel and R. Chellappa, "Sparse representations, compressive 
sensing and dictionaries for pattern recognition," in Asian Conference 
on Pattern Recognition (ACPR), Beijing, China, 2011. 



[5] S. Chen, D. Donoho, and M. Saunders, "Atomic decomposition by basis 
pursuit," SIAM J. Sci. Comp., vol. 20, no. 1, pp. 33-61, 1998. 

[6] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, "Orthogonal matching 
pursuit: recursive function approximation with applications to wavelet 
decomposition," Proc. 27th Asilomar Conference on Signals, Systems 
and Computers, pp. 40-44, 1993. 

[7] J. A. Tropp, "Greed is good: Algorithmic results for sparse approxima- 
tion," IEEE Trans. Info. Theory, vol. 50, no. 10, pp. 2231-2242, Oct. 
2004. 

[8] B. A. Olshausen and D. J. Field, "Emergence of simple-cell receptive 
field properties by learning a sparse code for natural images," Nature, 
vol. 381, no. 6583, pp. 607-609, 1996. 
[9] V. M. Patel, T. Wu, S. Biswas, P. J. Phillips, and R. Chellappa, 
"Dictionary-based face recognition under variable lighting and pose," 
IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, 
pp. 954-965, June 2012. 

[10] M. Aharon, M. Elad, and A. Bruckstein, "k-SVD: An algorithm for 
designing overcomplete dictionaries for sparse representation," IEEE 
Trans, on Signal Processing, vol. 54, no. 11, pp. 4311-4322, Nov. 2006. 

[11] K. Etemand and R. Chellappa, "Separability-based multiscale basis 
selection and feature extraction for signal and image classification," 
IEEE Trans, on Image Processing, vol. 7, no. 10, pp. 1453-1465, Oct. 
1998. 

[12] M. Yang, X. F. L. Zhang, and D. Zhang, "Fisher discrimination dictio- 
nary learning for sparse representation," in Proc. Intl. Conf. on Computer 
Vision, Barcelona, Spain, 2011. 

[13] F. Rodriguez and G. Sapiro, "Sparse representations for image clas- 
sification: Learning discriminative and reconstructive non-parametric 
dictionaries," Tech. Report, University of Minnesota, Dec. 2007. 

[14] E. Kokiopoulou and P. Frossard, "Semantic coding by supervised 
dimensionality reduction," IEEE Trans. Multimedia, vol. 10, no. 5, pp. 
806-818, Aug. 2008. 

[15] K. Huang and S. Aviyente, "Sparse representation for signal classifica- 
tion," in Neural Information Processing Systems, Vancouver, Canada, 
Dec. 2007. 

[16] J. Mairal, F. Bach, and J. Ponce, "Task-driven dictionary learning," IEEE 
TPAMI, vol. 34, no. 4, pp. 791 -804, April 2012. 

[17] J. Mairal, F. Bach, J. Pnce, G. Sapiro, and A. Zisserman, "Discriminative 
learned dictionaries for local image analysis," in IEEE Computer Society 
Conf. on Computer Vision and Patt. Recn., Anchorage, 2008. 

[18] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, "Supervised 
dictionary learning," in Neural Information Processing Systems, Vancou- 
ver, Canada, Dec. 2008. 

[19] Q. Zhang and B. Li, "Discriminative k-SVD for dictionary learning in 



10 



face recognition," in Proc. IEEE Computer Society Conf. on Computer 
Vision and Patt. Keen., San Francisco, CA, June 2010. 

[20] Z. Jiang, Z. Lin, and L. S. Davis, "Learning a discriminative dictionary 
for sparse coding via label consistent k-SVD," in IEEE Computer Society 
Conf. on Computer Vision and Patt. Keen., Colorado Springs, June 2011. 

[21] Q. Qiu, Z. Jiang, and R. Chellappa, "Sparse dictionary-based representa- 
tion and recognition of action attributes," in Proc. Intl. Conf. Computer 
Vision, Barcelona, Spain, Nov. 2011. 

[22] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, "Algorithms for simulta- 
neous sparse approximation, part i: Greedy pursuit," Signal Processing, 
vol. 86, pp. 572-588, 2006. 

[23] A. Krause, A. Singh, and C. Guestrin, "Near-optimal sensor placements 
in gaussian processes: Theory, efficient algorithms and empirical stud- 
ies," JMLR, no. 9, pp. 235-284, 2008. 

[24] M. E. Hellman and J. Raviv, "Probability of error, equivocation, and the 
Chernoff bound," IEEE Trans, on Info. Theory, vol. 16, pp. 368-372, 
1979. 

[25] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 
2006. 

[26] K. Torkkola, "Feature extraction by non parametric mutual information 
maximization," JMLR, vol. 3, pp. 1415-1438, Mar. 2003. 

[27] J. Kapur, "Measures of information and their applications," 1994, wiley. 

[28] A. S. Georghiades, R N. Belhumeur, and D. J. Kriegman, "From 
few to many: Rumination cone models for face recognition under 
variable lighting and pose," IEEE Trans. Pattern Analysis and Machine 
Intelligence, vol. 23, no. 6, pp. 643-660, June 2001. 

[29] "USPS handwritten digit database." in http://www-i6.informatik.rwth- 
aachen.de/ key sers/usps. html. 

[30] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial 
pyramid matching for recognizing natural scene categories," in IEEE 
Computer Society Conf. on Computer Vision and Patt. Recn., New York, 
NY, vol. 2, 2006, pp. 2169 - 2178. 

[31] J. Yang, K. Yu, Y. Gong, and T. Huang, "Linear spatial pyramid 
matching using sparse coding for image classification," in Proc. IEEE 
Computer Society Conf. on Computer Vision and Patt. Rec, Miami, FL, 
June 2009. 

[32] J. C. Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders, 

"Kernel codebooks for scene categorization," in Proc. European Conf. 

on Computer Vision, Marseiiles, France, Oct. 2008. 
[33] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, "Robust face 

recognition via sparse representation," IEEE TPAMI, vol. 31, no. 2, pp. 

210-227, 2009. 

[34] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, "Locality- 
constrained linear coding for image classification," in Proc. IEEE Com- 
puter Society Conf. on Computer Vision and Patt. Recn., San Francisco, 
June 2010. 



